Add Presidio text anonymization scaffold by XxSURYANSHxX · Pull Request #233 · healthyinc/bio-block

XxSURYANSHxX · 2026-06-05T09:34:20Z

@pradeeban, @karthiksathishjeemain, @Chali-healthy. This PR starts the Week 2 work for my GSoC project:

Bio-Block: Advanced PHI Anonymization and Hybrid Data Retrieval Pipeline

The main goal of this PR is to replace the Week 1 text placeholder handler with a real clinical text anonymization pipeline.

In Week 1, the ingestion endpoint could detect file modality and route files to the correct handler, but all handlers were still placeholders. This PR makes only the text handler real.

After this PR:

Text uploads now use real anonymization.
CSV uploads still use the placeholder handler.
DICOM uploads still use the placeholder handler.
NIfTI uploads still use the placeholder handler.
WSI uploads still use the placeholder handler.
IPFS, encryption, indexing, and blockchain steps still remain pending.

This keeps the PR small, focused, and limited to the Week 2 text anonymization scope.

Why This PR Is Needed

The Week 2 proposal scope is:

Presidio integration
Custom clinical recognizers
Deterministic surrogate generation
Safe entity summaries
Integration with the Week 1 ingestion endpoint

Before this PR, the text ingestion flow only selected a placeholder handler. It did not actually anonymize clinical text.

This PR adds the first real anonymization path for uploaded text files while avoiding unrelated backend changes.

Main Changes

1. Added a Clinical Text Anonymization Service

A new service was added:

python_backend/services/text_anonymization.py

This service exposes a focused function:

anonymize_clinical_text(
    text: str,
    profile: str = "strict",
    study_salt: str | None = None,
) -> dict

The function:

Validates the input text.
Rejects empty text.
Uses Presidio analyzer flow.
Registers custom clinical recognizers.
Replaces detected PHI with safe surrogates or redaction tokens.
Returns anonymized text.
Returns only safe entity counts.
Does not return raw detected PHI values.

Example response shape:

{
  "anonymization_status": "completed",
  "anonymized_text": "Patient has MRN_A1B2C3D4 and email <REDACTED_EMAIL>.",
  "detected_entities": {
    "MEDICAL_RECORD_NUMBER": 1,
    "EMAIL_ADDRESS": 1
  }
}

Presidio Integration

This PR uses Microsoft Presidio for the text anonymization pipeline.

The new service uses Presidio's AnalyzerEngine with custom PatternRecognizer instances.

One important implementation detail is that the new Week 2 text service avoids hidden runtime model downloads.

Presidio's default setup can try to load or download a spaCy model if none is available. To avoid that, this PR uses a blank local spaCy tokenizer for pattern-based recognition. This keeps the new text ingestion path deterministic and safe for local development and CI.

No spaCy model download is required for this new Week 2 text ingestion path.

Custom Clinical Recognizers Added

This PR adds custom recognizers for clinical identifiers that general PII recognizers may miss.

The added clinical entity types are:

MEDICAL_RECORD_NUMBER
PATIENT_ID
HEALTH_PLAN_ID
ACCESSION_NUMBER
DEVICE_ID

The recognizers are designed to be conservative.

For example, this should be detected:

MRN: 123456

But this should not be blindly treated as an MRN:

The room number 123456 was cleaned.

This is important because clinical notes often contain many numbers that are not identifiers.

Medical Record Number Recognition

The MRN recognizer supports clinical context such as:

MRN
medical record
medical record number
hospital number
chart number

Supported examples include:

MRN: 123456
MRN 123456
Medical Record Number - 123456

The recognizer is case-insensitive, so both of these are supported:

MRN: 123456
mrn: 123456

Patient ID Recognition

The Patient ID recognizer supports context such as:

patient id
patient number
patient identifier
pt id

Example:

Patient ID PT-1001 was admitted.

The raw ID is replaced with a deterministic surrogate like:

PATIENT_ID_XXXXXXXX

Health Plan / Insurance ID Recognition

The health plan recognizer supports context such as:

health plan
beneficiary
insurance
policy
member id
subscriber id

Example:

Insurance ID ABC123456789 was verified.

The raw value is replaced with a deterministic surrogate like:

HEALTH_PLAN_XXXXXXXX

Additional Clinical Recognizers

This PR also adds recognizers for:

Accession Number

Example contexts:

accession
accession number
acc no
accession no

Replacement format:

ACCESSION_XXXXXXXX

Device ID / Serial Number

Example contexts:

device
device id
serial
serial number
implant
equipment

Replacement format:

DEVICE_XXXXXXXX

Deterministic Surrogate Generation

This PR adds deterministic surrogate generation using salted SHA-256 hashes.

The behavior is:

Same original value plus same salt gives the same surrogate.
Same original value plus different salt gives a different surrogate.
Different original values give different surrogates.
The surrogate does not contain the original value.
No reversible mapping is stored.
Raw PHI values are not returned in the API response.

Examples:

MRN: 123456

becomes something like:

MRN_A1B2C3D4

Patient ID PT-1001

becomes something like:

PATIENT_ID_9F8E7D6C

Insurance ID ABC123456789

becomes something like:

HEALTH_PLAN_12AB34CD

The implementation uses SHA-256 from the Python standard library.

Common PHI Handling

This PR also handles several common PHI patterns.

Email Addresses

Email addresses are replaced with:

<REDACTED_EMAIL>

Example:

john.doe@example.com

becomes:

<REDACTED_EMAIL>

Phone Numbers

Phone numbers are replaced with:

<REDACTED_PHONE>

Example:

555-123-4567

becomes:

<REDACTED_PHONE>

SSNs

SSN-like values are replaced with:

<REDACTED_SSN>

Dates

Common date patterns are currently replaced with:

<REDACTED_DATE>

Date shifting is not implemented in this first Week 2 slice. This PR keeps date handling simple and safe by redacting common date formats for now.

Ingestion Endpoint Integration

The Week 1 ingestion endpoint is:

POST /api/v1/ingest

This PR updates the ingestion flow so that text files now go through the real anonymization service.

For text uploads, the endpoint now:

Detects the uploaded file as text.
Reads the text content safely.
Applies a 256 KiB text upload limit.
Decodes the file as UTF-8.
Rejects unsupported encoding clearly.
Calls the clinical text anonymization service.
Returns anonymization_status: "completed".
Returns anonymized text.
Returns safe entity counts only.
Keeps downstream pipeline steps as pending.

For non-text uploads, the endpoint still returns the Week 1 placeholder response.

Response Safety

This PR is careful about not exposing raw PHI.

The API response may include:

Filename
Detected modality
Privacy profile
Handler name
Routing status
Anonymization status
Anonymized text
Safe detected entity summary
Downstream pending states

The API response does not include:

Raw uploaded text separately
Raw detected PHI values
Raw entity examples
Debug traces
Internal stack traces
Reversible mappings

The entity summary only includes entity types and counts.

Safe example:

{
  "MEDICAL_RECORD_NUMBER": 1,
  "EMAIL_ADDRESS": 1
}

Unsafe example not used by this PR:

{
  "detected_values": ["123456", "john.doe@example.com"]
}

Text Upload Size Guard

This PR adds a text upload size guard:

256 KiB

If a text upload is larger than the limit, the endpoint returns a clear error.

This avoids accidentally reading very large text files into memory during this first implementation slice.

Streaming or chunked anonymization is not implemented yet.

UTF-8 Handling

Text files are decoded as UTF-8.

If the uploaded text file is not valid UTF-8, the endpoint returns a clear error instead of failing with an internal exception.

Example error:

Text uploads must be UTF-8 encoded

What Was Intentionally Not Changed

This PR does not implement any non-text anonymization work.

Not included:

DICOM metadata scrubbing
NIfTI metadata scrubbing
OCR pixel redaction
WSI tiling
CSV k-anonymity
l-diversity
t-closeness
BM25
ChromaDB retrieval changes
Semantic search
RRF/MMR
IPFS integration
CID encryption
Blockchain transaction flow

This PR also does not touch:

/store
/store_enhanced
Existing search endpoints
ChromaDB collections
Blockchain/IPFS code
Unrelated preview logic

The goal was to keep this PR focused only on Week 2 text anonymization.

Files Changed

Added

python_backend/services/text_anonymization.py
python_backend/tests/test_text_anonymization.py

Updated

python_backend/services/ingestion.py
python_backend/main.py
python_backend/tests/test_ingestion.py

Tests Added

A new service-level test file was added:

python_backend/tests/test_text_anonymization.py

The ingestion test file was also updated:

python_backend/tests/test_ingestion.py

The tests cover:

MRN detection with context
Avoiding MRN false positives for random numbers
Case-insensitive MRN recognition
Separator variations for MRN values
Patient ID detection and replacement
Health plan / insurance ID detection and replacement
Deterministic surrogate behavior with the same salt
Different surrogate behavior with a different salt
Email redaction
Phone redaction
Medical term preservation
No-PHI text success
Empty text rejection
Safe entity summary shape
Overlapping email handling
Text ingestion through /api/v1/ingest
MIME mismatch routing by .txt extension
Unsupported encoding rejection
Large text upload rejection
Non-text modalities still returning placeholders

Test Results

I ran the focused Week 2 tests.

Text anonymization service tests

Command:

py -3.11 -m pytest tests/test_text_anonymization.py

Result:

14 passed

Ingestion endpoint tests

Command:

py -3.11 -m pytest tests/test_ingestion.py

Result:

15 passed

There were a few existing dependency warnings during the ingestion tests, but there were no test failures.

Dependency Notes

No new dependencies were added in this PR.

The required packages were already present in:

python_backend/requirements.txt

Already present:

presidio-analyzer
presidio-anonymizer
spacy

No spaCy model download is required for the new Week 2 text ingestion path.

Current Behavior by Modality

Text

Text now uses real anonymization.

Status:

completed

CSV

CSV still uses the placeholder handler.

Status:

placeholder

DICOM

DICOM still uses the placeholder handler.

Status:

placeholder

NIfTI

NIfTI still uses the placeholder handler.

Status:

placeholder

WSI

WSI still uses the placeholder handler.

Status:

placeholder

Current Limitations

This is the first Week 2 implementation slice, so a few things are intentionally left for later.

Current limitations:

PERSON detection is not forced without a reliable spaCy model.
Date shifting is not implemented yet.
Common date patterns are currently redacted instead of shifted.
The default salt is only a development fallback.
The real production source for study_salt still needs mentor confirmation.
Streaming/chunked anonymization for large text files is not implemented yet.

Privacy Notes

This PR uses only synthetic test examples.

Examples used in tests include fake values such as:

John Doe
MRN: 123456
john.doe@example.com
555-123-4567
PT-1001
ABC123456789

No real PHI was added.

The implementation does not print raw uploaded text or raw detected values.

The entity summary is safe because it only contains counts by entity type.

Please let me know your feedback and if there are any changes required, I will make more commits to this as i improve it even further. Thankyou so much :)

karthiksathishjeemain

Thanks for the PR Suryansh. I like the approach of introducing two different steps of removing PHI (Pseudonymize and Redact). However there is a security bug I have mentioned. Please rectify it and once it is done, I will approve the PR.

karthiksathishjeemain · 2026-06-08T17:43:16Z

+from typing import Any, Dict, Iterable, Iterator, List, Tuple
+
+SUPPORTED_PROFILES = {"strict", "research"}
+DEFAULT_STUDY_SALT = "bio-block-week2-development-salt"


Please keep the DEFAULT_STUDY_SALT in a secret file like .env. Even though SHA-256 is irreversible, the attacker can still match the hash by generating all possible ones (patient ID is a low-entropy field i.e usually it ranges from 1 - 10^6 and generating these many hash is easier when you already know the salt).

Implemented the changes @karthiksathishjeemain. Please have a look once again and let me know your feedback. Thankyou so much :)

karthiksathishjeemain

Thanks for the changes @XxSURYANSHxX. @pradeeban please merge this PR. Good work!!

pradeeban · 2026-06-09T10:53:28Z

Merged. Thanks.

Add Presidio text anonymization scaffold

0afe168

karthiksathishjeemain suggested changes Jun 8, 2026

View reviewed changes

Move text anonymization salt to environment config

98da5a7

XxSURYANSHxX requested a review from karthiksathishjeemain June 9, 2026 09:29

karthiksathishjeemain approved these changes Jun 9, 2026

View reviewed changes

pradeeban merged commit 2a84f9d into healthyinc:dev Jun 9, 2026
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Presidio text anonymization scaffold#233

Add Presidio text anonymization scaffold#233
pradeeban merged 2 commits into
healthyinc:devfrom
XxSURYANSHxX:gsoc-week2-presidio-text-anonymization

XxSURYANSHxX commented Jun 5, 2026

Uh oh!

karthiksathishjeemain left a comment

Uh oh!

karthiksathishjeemain Jun 8, 2026

Uh oh!

XxSURYANSHxX Jun 9, 2026

Uh oh!

karthiksathishjeemain left a comment

Uh oh!

Uh oh!

pradeeban commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

XxSURYANSHxX commented Jun 5, 2026

Why This PR Is Needed

Main Changes

1. Added a Clinical Text Anonymization Service

Presidio Integration

Custom Clinical Recognizers Added

Medical Record Number Recognition

Patient ID Recognition

Health Plan / Insurance ID Recognition

Additional Clinical Recognizers

Accession Number

Device ID / Serial Number

Deterministic Surrogate Generation

Common PHI Handling

Email Addresses

Phone Numbers

SSNs

Dates

Ingestion Endpoint Integration

Response Safety

Text Upload Size Guard

UTF-8 Handling

What Was Intentionally Not Changed

Files Changed

Added

Updated

Tests Added

Test Results

Text anonymization service tests

Ingestion endpoint tests

Dependency Notes

Current Behavior by Modality

Text

CSV

DICOM

NIfTI

WSI

Current Limitations

Privacy Notes

Uh oh!

karthiksathishjeemain left a comment

Choose a reason for hiding this comment

Uh oh!

karthiksathishjeemain Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

XxSURYANSHxX Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

karthiksathishjeemain left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pradeeban commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants